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ABSTRACT 

In silico prediction of genomic long non-coding RNAs 
(IncRNAs) is prerequisite to the construction and 
elucidation of non-coding regulatory network. 
Chromatin modifications marked by chromatin regu- 
lators are important epigenetic features, which can 
be captured by prevailing high-throughput 
approaches such as ChIP sequencing. We demon- 
strate that the accuracy of IncRNA predictions can 
be greatly improved when incorporating high- 
throughput chromatin modifications over mouse em- 
bryonic stem differentiation toward adult Cerebellum 
by logistic regression with LASSO regularization. The 
discriminating features include H3K9me3, H3K27ac, 
H3K4me1, open reading frames and several repeat 
elements. Importantly, chromatin information is 
suggested to be complementary to genomic 
sequence information, highlighting the importance 
of an integrated model. Applying integrated model, 
we obtain a list of putative IncRNAs based on 
uncharacterized fragments from transcriptome 
assembly. We demonstrate that the putative 
IncRNAs have regulatory roles in vicinity of known 
gene loci by expression and Gene Ontology enrich- 
ment analysis. We also show that the IncRNA expres- 
sion specificity can be efficiently modeled by the 
chromatin data with same developmental stage. 
The study not only supports the biological hypothesis 
that chromatin can regulate expression of tissue- 
specific or developmental stage-specific IncRNAs 
but also reveals the discriminating features between 
IncRNA and coding genes, which would guide further 
IncRNA identifications and characterizations. 



INTRODUCTION 

The ENCODE and related projects have revealed that the 
majority of eukaryotic transcripts are non-coding RNAs 
(1). Within the past few years, non-coding RNAs 
(ncRNAs) have attracted significant attention with 
regard to their unbelievably numerous biological roles, 
highlighting the biological significance of previously 'over- 
looked' RNA reservoir (2). Generally, long non-coding 
RNAs (IncRNAs) are ncRNAs that are longer than 
200 nt and are typically expressed in a developmental 
stage-specific manner (2). Other criteria have also been 
used such as open reading frame (ORF) size <300nt 
and high conservation for filtering interested IncRNAs 
(3,4). Most of IncRNAs have short ORFs based on con- 
ceptual translation and may not generate proteins (5). 
Similar to protein-coding genes, most IncRNAs are 
supposed to be transcribed by RNA polymerase II and 
have typical pre-mRNA-like structure including 5' Cap 
and polyA+ tail. IncRNA species can be divided into 
distinct categories basically including intergenic, sense, 
antisense, intronic and bidirectional transcripts (6). 
Previously, IncRNAs were considered to be transcrip- 
tional noises and experimental artifacts (7). However, 
IncRNAs are involved in a plethora of cellular processes 
including /raw-regulation of nearby protein-coding genes 
(8), imprinting control (9) and alternative splicing (10). 
For instance, HOTAIR, a HOX-associated IncRNA, 
which originates from the HOXC locus, can target 
PRC2 and silence the transcription of the HOXD locus 
in trans (11). Importantly, a few IncRNAs are involved in 
embryogenesis in vertebrates (12-15). 

Computational approaches are widely used to identify 
non-coding genes in previous studies. These approaches 
typically identify non-coding genes that have short ORFs 
and lack homology to protein-coding genes (16). It is still 
difficult to distinguish IncRNA genes with long ORFs from 
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long protein-coding genes. In addition, a large number of 
non-coding transcripts originate from protein-coding loci 
(17). To tackle the challenges, initial computational 
approaches based solely on genomic sequences are used 
to identify IncRNAs. ORF length (3), ORF conservation 
(18,19), structural approach (20) and artifact filtering (3,21) 
are the most common criteria used to distinguish IncRNAs 
from coding mRNAs (22). Machine-learning algorithms 
are one solution for non-coding RNA identification. Four 
recent representative tools, CPC (23), PORTRAIT (24), 
CONC (25) and iSeeRNA (26), use support vector 
machines to distinguish ncRNAs from coding mRNAs. 
These support vector machine-derived algorithms take 
multiple features such as peptide length, amino acid com- 
position, sequence conservation and sequence alignment in- 
formation into consideration, representing the pioneering 
methods for identifying IncRNAs from unknown genomic 
regions. Though successful in predicting structural RNAs 
(e.g. transfer RNA), these approaches do not perform well 
in non-structural IncRNAs, due partly to that non- 
structural RNAs are mRNA-like and less conserved 
among different species, which limits the applicability of 
sequence alignment information in IncRNA prediction. 
Recently, computational studies aiming at identifying 
non-structural IncRNAs are accumulating. In brief, they 
can be summarized by three strategies: cDNA 
sequencing-based filtering, chromatin modification 
mapping and RNA sequencing-based filtering. First, 
Boerner et al. (27) developed a computational pipeline for 
identifying IncRNAs from cDNA sequences. Though 
cDNA sequence is one source for locating IncRNAs in 
genome, the relatively higher costs hinder the extensive 
use. Second, Guttman et al. identified over 1000 putative 
IncRNAs in mouse genome based on H3K4me3 and 
H3K36me3 marks. Though successful, one potential limi- 
tation of their study is that IncRNAs are assumed to be 
regulated by same chromatin marks as protein-coding 
genes, which would underestimate the number of actual 
IncRNAs. Last, Sun et al. (28) proposed a computational 
tool for filtering IncRNAs from RNA sequencing (RNA- 
seq) data. Though RNA-seq-based transcriptome recon- 
struction is promising for de novo IncRNA identification, 
it suffers from sequencing precision and lagging algorithms 
for accurately building full-length transcripts due to low 
abundance of IncRNAs (29,30). However, the wide avail- 
ability of RNA-seq data provides a basis for identifying a 
large number of potential tissue-specific IncRNAs that can 
be further filtered. 

Though sequence-based methods achieve good perform- 
ance against golden-standard sequence sets, it is not prac- 
tical to derive tissue-specific expression information from 
them, making it inefficient to validate and analyze IncRNA 
function experimentally. Several studies have shown that 
chromatin modifications are helpful to increase genomic 
element prediction efficiency (31,32). IncRNAs rely on epi- 
genetic mechanisms to regulate cell differentiation and 
organ development (33), but little is known about the 
roles of epigenetic modifications in IncRNA transcriptional 
regulation. Owing to the chromatin immunoprecipitation 
followed by massively parallel sequencing (ChlP-seq) 
technique, which has been widely used to investigate 



genome-wide chromatin modifications in mammalian 
genomes (34,35), we are provided an opportunity to under- 
stand on a genome-wide scale how IncRNAs are regulated 
in a cell-type-specific manner based on tissue-specific or 
developmental stage-specific RNA-seq data and ChlP-seq 
data. Importantly, little is known about the chromatin 
modification and genomic features discriminating 
IncRNAs from protein-coding genes, which emphasizes 
the necessity to integrate chromatin features in different 
developmental stages and genomic information in a 
machine-learning model and evaluate their importance 
for distinguishing IncRNAs from protein-coding genes. 

To this end, we use 22 publicly available high-throughput 
mouse ChlP-seq data sets involving three developmental 
stages as well as 19 genomic features to identify features 
that discriminate IncRNAs from protein-coding genes. 
We use logistic regression with LASSO regularization 
to identify discriminating IncRNA signatures, evaluate 
model performance based on known gene annotations and 
finally predict IncRNAs from transcriptome-assembled 
transcribed fragments (transfrags). Logistic regression 
model, one of the most famous machine learning models, 
can integrate a large number of features and identify 
discriminating features/markers, which would tackle the po- 
tential problem that each single feature may not be com- 
pletely representative of IncRNAs. For example, although 
sequence conservation is usually used to identify regulatory 
non-coding genes, the degree of variation of conservation 
involving IncRNAs makes that high conservation feature 
can depict only a small subset of development related 
IncRNAs (36). Through the trained model, we evaluate 
the usefulness of chromatin modification and genomic 
features for IncRNA prediction. Specifically, LASSO regu- 
larization ranks the predictability of features for IncRNA 
classification with three transposable elements being the 
most predictive features followed by two enhancer-related 
histone modification marks H3K9me3 and H3K27ac in 
gene exons. In addition, the comparison of performance 
of model using chromatin information and/or sequence in- 
formation implies the ability of the integrated model to 
identify IncRNAs involved in embryonic development. 
Though chromatin modification features yield the perform- 
ance comparable with genomic features, the model 
integrating both types of information performs even better 
against the model with only one type. The IncRNA expres- 
sion specificity is contributed mostly by the chromatin 
modification features from same developmental stage, sug- 
gesting the trained integrated model is capable of predicting 
tissue-specific IncRNAs. A large pool of putative IncRNAs 
that exhibit comparable genomic property and chromatin 
profile with known IncRNAs is predicted from RNA-seq 
data. Furthermore, the coexpression of putative IncRNAs 
with nearby genes and enrichment in close to genes 
associated with neuron differentiation and transcriptional 
regulation functions suggest the potential regulatory func- 
tionality of IncRNAs. 

MATERIALS AND METHODS 

The model construction, graphs and statistical ana- 
lyses were done in R (http://www.r-project.org) and 
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Bioconductor (www.bioconductor.org) unless otherwise 
stated. 

Genomic features and chromatin data sets 

The chromatin modification data were taken from the 
mouse ENCODE project (35). Twenty-two publicly avail- 
able chromatin ChlP-seq raw data in mouse embryonic 
stem (ES) cell, Embryonic day 14.5 (El 4. 5) embryonic 
brain and adult 8-week Cerebellum (CB) were downloaded 
from Sequence Read Archive (37) including the following 
chromatin modifications: CCCTC-binding Factor 
(CTCF), polymerase II (PolII), H3K4mel, H3K4me3, 
H3K9me3, H3K27ac, H3K27me3 and H3K36me3 
(Table 1), in addition to Input. CpG islands information 
predicted by CpG_MI was obtained from Su et al. (38). 
Other genomic features including Phastcons most 
conserved regions and repeat elements were downloaded 
from UCSC (mouse mm9 build) (39). ORF-related 
features including ORF_proportion (ratio of ORF to tran- 
script length) and ORF_length were calculated by txCds 
Predict provided by UCSC. PhastCons most conserved 
regions and nucleotide sequences-based features were 
calculated by iSeeRNA program suite (26). 

Known IncRNAs for model training and testing 

RefSeq was generated by both automated and curated 
approaches to provide an up-to-date representation of 
gene sequences (40). Here, we used the lncRNA and 
long protein-coding genes from RefSeq for model 
training because the annotation typically had support by 
experimental validation. In RefSeq, annotated IncRNAs 
are considered unlikely to be protein-coding for several 
reasons, including non-sense-mediated decay, truncated 
ORFs and alternate splice variants with significant ORF 
truncation. In addition, RefSeq lncRNA annotations were 
also incorporated in the Havana and Ensembl gene 
building processes. 

We used Ensembl (41) transcripts for validating the 
proposed integrated model. The Ensembl lncRNA anno- 
tation was based on cDNA alignments and was guided by 



chromatin modifications (H3K4me3 and H3K36me3). 
Specifically, cDNAs with H3K4me3 and H3K36me3 
overlap were identified, followed by protein-coding poten- 
tial filtering. The procedure was similar to that performed 
by Guttman et al. (4). The candidate IncRNAs with 
maximum ORF covering >35% of its length and with 
PFAM/tigrfam protein domains were not considered as 
non-coding genes in Ensembl annotation. Other than elec- 
tronic annotation, the Ensembl annotation also included 
Havana manual annotation, suggesting that it was a 
reliable non-coding and coding annotation source. 

Sequencing data pre-processing 

Raw chromatin modification data (FASTQ format) were 
aligned to the mouse genome (mm9) by BWA software 
(vO.6.2) while suppressing alignments with >2 mismatches 
within a read. The aligned data were then normalized by a 
fixed read number (25 000 000). Exact duplicate tags were 
removed from each ChlP-seq data set to avoid PCR amp- 
lification biases introduced in the sequencing library prep- 
aration processes. Visual inspection suggested that the 
baseline read numbers were generally comparable for all 
normalized chromatin modification data, diminishing dif- 
ferences between different data. The aligned sequencing 
data were processed by MACS (42) to produce enriched 
chromatin domains. 

In lncRNA expression specificity analysis and the rela- 
tionships with nearby gene analysis, we also used RNA- 
seq data of the same developmental stages with ChlP-seq 
data, making it suitable for model construction and 
testing. Raw paired end RNA-seq FASTQ data were 
downloaded from GEO with the accession number 
GSE20851 for ES cell and GSE36025 for E14.5 embryonic 
brain and adult 8-week CB (43). Raw RNA-seq data were 
aligned to the mouse genome (mm9) by TopHat software 
(44), followed by gene expression quantification according 
to a widely accepted protocol (45). The expression levels of 
transcripts were quantified by Fragments Per Kilobase of 
transcript per Million mapped reads (45). 

To obtain assembled non-coding transcriptome data 
sets for genome-wide lncRNA predictions, we removed 



Table 1. The details of 227 features used in the integrated model 



Features 


Cell 
line 


Tissue/cell 
type 


Accession 


Feature number 


Data type 


Histone modifications (H3K4mel, H3K4me3, 


E14 


US 


GSE31039 


8, each with 7 subfeatures, summing up 


ChlP-seq 


H3K9me3, H3K36me3, H3K27ac, 








to 56 features 




H3K27me3), PolII and CTCF 












Histone modifications (H3K4mel, H3K4me3, 




E14.5 


GSE31039 


8, each with 7 subfeatures, summing up 


ChlP-seq 


H3K9me3, H3K36me3, H3K27ac, 




whole 




to 56 features 




H3K27me3), PolII and CTCF 




brain 








Histone modifications (H3K4mel, H3K4me3, 




8 week 


GSE31039 


6, each with 7 subfeatures, summing up 


ChlP-seq 


H3K27ac, H3K27me3), PolII and CTCF 




CB 




to 42 features 




Repeat elements and CpG islands 






UCSC mm9 


9, each with 7 subfeatures, summing up 
to 63 features 


Bed format 


PhastCons most conserved regions, ORF and 






UCSC mm9 


1 conservation feature, 2 ORF features, 


BED format 


nucleotide sequences-based features 








7 nucleotide sequences-based features, 
summing up to 10 features 





Columns represent the feature names, the cell lines of mouse ES cell used (cell Line), the tissue/cell type of the data (tissue/cell type), the NCBI GEO 
accession number (Accession), feature number and the data type. 
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the reads mapped on RefSeq and Ensembl genes. Further 
filtering included the removal of transfrags with short 
length (<400nt), lower expression (Fragments per 
kilobase of transcript per million mapped reads < 0.1), 
coding potential CPC score less than —1 by CPC (23), 
long ORFs with length >600nt (26). The remaining 
transfrags were predicted by our trained feature-selected 
model. The resulting putative IncRNAs for three devel- 
opmental stages were listed in Supplementary Table SI. 

Feature calculation 

The genomic (only repeat elements and CpG islands) and 
chromatin features of known and predicted transcripts 
were quantified for the following seven categories respect- 
ively: Transcription Start Sites (TSSs) upstream 3k, 
promoter (TSS until end of first exon), 1st third, 2nd 
third and 3rd third of gene body excluding 1st exon, re- 
spectively, exons in together and lastly Transcription End 
Sites downstream 3k. For transcripts with only one exon, 
the promoter was defined as 20% of gene length down- 
stream of TSSs. The IncRNA body was defined as the 
gene body without first exon. The rule makes seven 
subfeatures to be modeled for each feature (except 
10 sequence features), summing up to 227 features 
(Table 1). The full data for all features including 
subfeature information are in Supplementary Table S2. 
The general naming conventions of the subfeatures are 
tissue_feature_region, where region represents one of 
3000bp_u _TSS, 3000bp_d _TSS, 1_3 _gene, 2_3 _gene, 
3_3 _gene, exons and 3000bp_d_TTS, corresponding to 
the aforementioned feature categories, respectively. The 
subfeature calculations were done by writing custom 
JAVA scripts. The feature values for all genes used in 
the training and testing model were given in 
Supplementary Table S2. 

Selection of potential sequence features is one of the 
most important steps before modeling. We referred to 
Sun et a/.'s article (26) to select 10 effective sequence 
features that are listed in Table 1. In addition, the 
feature space of di- or tri-nucleotide sequence features 
is large, and these features are highly inter-correlated, 
which would decrease the performance of integrated 
model. We only used GC, CT, TAG, TGT, ACG and 
TCG in our model, which were shown to contribute 
mostly to the prediction performance (26). Though 
other features such as those based on homolog search 
were often used for non-coding RNA classification, 
they were highly correlated with the conservation and 
did not improve model performance (26). In addition, 
CpG island feature was used here because many 
intergenic CpG islands were associated with non-coding 
RNAs (46). Repeat elements were also reckoned to 
regulate IncRNA evolution (47), which were also 
included in our model. 

The gene expression specificity based on three develop- 
mental stages was used for analyzing IncRNA expression 
specificity-related chromatin and genomic features. We 
used QDMR software (48) to calculate expression specifi- 
city. Transcripts without developmental stage-specific 
expression fell into no specificity category. 



Feature selection by LASSO regularized logistic 
regression 

Standardization that can avoid shrinkage of feature 
weights was performed by subtracting the mean and 
dividing by the standard deviation for all features in 
integrated model. A binomial logistic regression was 
used to model sequence and chromatin features and then 
predict putative IncRNAs from transcriptomic data in ES 
cell, E14.5 brain and adult CB. Let y t = 0 or 1 to represent 
IncRNA or protein-coding gene. Define y = \y h y2,---, 
y„] T as the binary class label for all n genes. The probabil- 
ity of Vj = 1 is given by p f = Pr(y>j = 1), i = 1, . . ., n. The 
logistic regression model is defined by: 



logit{pi) = p 0 +Y] fay 



(1) 



7=1 



where /3 is the regression coefficient of variable x, which 
indicates how well each feature explains the difference 
of IncRNA and protein-coding genes. The logit(/?,) is 
defined by: 

PrO,- = 1) 



logitipi) = log 



(2) 



.1 - Pr()V = 1) 

To select efficient features, we used LASSO regularization 
that introduces an additional penalty with a power raised 
on the weight vector (49). Using LASSO regularization, 
feature weights of less significance would shrink to 0 as 
lambda increases. 

The LASSO estimate of fi is determined by 



olasso 



arg mm 

P 



7=1 



(3) 



7=1 



For the unbalanced size of RefSeq protein-coding and 
IncRNA genes, we split the coding genes into 10 parts, 
each of which together with all IncRNAs constituted a 
data set, respectively. Based on the lambdas that 
minimized the Mean Squared Error from cross-validation 
for each of the 10 data sets, we obtained a feature selected 
model for each data set, respectively. 

Model evaluation 

The model assessment measures included Area Under 
Curve (AUC) under Receiver Operating Characteristic 
(ROC) curves, Precision and Recall in 10-fold cross-valid- 
ation. The performance was averaged for 10 data sets. We 
calculated averaged AUC, Precision and Recall values for 
IncRNA and protein-coding gene predictions. We defined 
False Negative (FN) as the number of IncRNAs that 
were classified as coding RNAs in our predictions. The 
Positive Predictive Value (PPV), specificity and sensitivity 
were given as TP/(TP + FP), TN/(FP + TN) and TP/ 
(TP + FN), respectively. 

Gene Set Enrichment to elucidate IncRNA functions 

Gene Set Enrichment was done to explore IncRNA regu- 
latory functions. Briefly, we explored the function of genes 
closest to putative IncRNAs and systematically performed 
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Gene Ontology (GO) gene function enrichment of these 
genes using the DAVID system (50) followed by clustering 
of functions with significant numbers of shared genes 
using Enrichment map (51). Only Biological Process 
branches were involved in GO enrichment. 

RESULTS 

Specific features can discriminate long non-coding genes 
from coding genes 

To discriminate 24 761 long coding and 2383 long non- 
coding RefSeq gene regions, we used mouse ChlP-seq 
data sets in three developmental stages: (1) ES cells; (2) 
E14.5 brain and (3) CB, involving seven histone modifica- 
tions, one polymerase occupancy and one chromatin- 
associated protein (Table 1). In addition to these ChlP-seq 
data sets, genomic features including CpG islands, ORFs, 
repeat regions and PhastCons most conserved regions were 
also incorporated in our model. Chromatin features were 
tissue specific, which implied that the model can be applic- 
able to other cell/tissue contexts to predict other tissue- 
specific IncRNAs. Besides for model training, all of these 
data sets were also used for model validation and prediction 
purposes. Because of the unbalanced size of coding and 
IncRNA genes, the coding genes were randomly drawn 
into 10 gene sets with the whole IncRNA genes and 10 
non-overlapping coding gene sets of equal size. 

In our initial binomial model, we classified the training 
genes into two categories: coding and IncRNA genes. The 
regions of gene body plus 3-kb extended from transcript 
ends were split into seven parts (refer to 'Materials and 
Methods' section). Logistic regression model assumes that 
each feature contributes independently to the overall per- 
formance. However, the trained model would be biased if 
some features are partially or fully redundant. For 
example, some histone modifications like H3K4mel, 
H3K4me2 and H3K4me3 are likely to colocalize specific 
regions, making some features have no or little additional 



predictive power. To tackle this, we used a LASSO 
regularized binomial logistic regression model to 
evaluate features, by which we can rank the contributions 
of features to the classification of IncRNA and coding 
genes. LASSO regularization can shrink feature weights 
by introducing a lambda penalty factor to filter out unin- 
formative or redundant features. Top ranking features 
(non-zero weights and occurring at least in 7 of the 10 
models) with respect to specific log lambda values for 10 
training models were kept to build the feature selected 
model, whereas the feature weights with respect to log 
lambda were shown in Figure 1. As increasing the 
penalty parameter lambda, we observed that the weights 
of less informative features shrink to zero, whereas 
weights of informative features keep above-zero. By 
LASSO regularization, the most positively predictive 
features for coding genes under the log lambda values 
determined by cross-validation were ORF_proportion 
and ORF_length, which have been widely used for 
coding gene prediction in documented studies. It was 
not unexpected because long ORF is less likely to be 
observed in non-coding RNA sequences. H3K4me3_ 
3000bp_d_TSS (promoter) in ES and CB ranked third 
and fourth, respectively. H3K4me3 was previously 
shown to associate with active gene expression around 
TSSs (52), which is consistent with that IncRNAs are gen- 
erally lowly expressed. Surprisingly, features that best 
categorized IncRNA genes were three repeat elements, 
followed by ES_H3K9me3_exons and E14_H3K27ac_ 
exons, in agreement with IncRNA characteristics. Only 
one stage of the two chromatin modification data was 
selected by the model, which was reasoned by the 
average concatenated exon profiles generated by CEAS 
software (53) in Supplementary Figure SI. From 
Supplementary Figure S1A and C, we observed that 
H3K9me3 is more enriched in exon in ES cell than E14 
brain, whereas H3K27ac is more enriched in exon in E14 
brain than ES cell (Supplementary Figure SIB, D and E). 
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Figure 1. Selected informative features determined by the strict threshold using binomial logistic regression with LASSO regularization. Feature 
weights in predicting IncRNAs are with respect to log lambda, a penalty to shrink feature weights in the regression model. Weights of features with 
less discriminative power of the IncRNAs and protein-coding genes shrink to 0 as lambda is increasing. Informative features are those with above- 
zero weights based on lambda value determined by cross-validation. 
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Table 2. Comparison of average performance of 10-fold cross-valid- 
ation in RefSeq IncRNA and protein-coding gene predictions using 
all features, chromatin data only, sequence only and feature selected 
features 



Used features 


AUC 


Precision 


Recall 


All features 


0.927 


0.857 


0.857 


All chromatin features 


0.827 


0.760 


0.759 


All sequence features 


0.876 


0.797 


0.796 


Feature selected features 


0.882 


0.833 


0.832 



The comparisons include the AUC under ROC curves, precision and 
recall values. 



After comparing 10-fold cross-validation results, we 
found that the performance of integrated model decreased 
slightly after feature selection. AUC, Precision and Recall 
were used to assess models (refer to 'Materials and 
Methods' section), using 10-fold cross-validation 
(Table 2). Though chromatin data only model achieved 
relatively worse performance, integrated model with chro- 
matin information improved the IncRNA prediction. The 
logit of the full feature selected model was given as follows: 

logitf^O' = protein - coding)] = 4.001ORF_length 
+ 5.3330RF_proportion 

- 0.426E14TT3K4mel_exons 

+ 1.718ES_H3K4me3_3000bp_d_TSS 

- 0.473ES_H3K9me3_exons 

+ 0.304CB_H3K4me3_3000bp_d_TSS 

- 1.441E14_H3K27ac_exons 

+ 1.694E14_H3K4me3_3000bp_d_TSS 
+ 0.156E14_H3K36me3_exons 

- 2.792DNA_repeat_exons 

- 0.796LTR_repeat_3000bp_d_TSS 

- 5.824LINE_repeat_exons 

+ 0.023SINE_repeat_l_3_gene 

- 5.779LTR_repeat_exons 

+ 0.01 lSINE_repeat_3000bp_u_TSS 

- 0.066SINE_repeat_exons+ 0.601 

In our model, the size of bin upstream and downstream 
of two categories of genes that was used for feature quan- 
tification was arbitrarily defined as 3 kb. However, the bin 
size may not be optimal, as the chromatin signals are ex- 
pected to span thousands of genomic base pairs. A short bin 
size may only capture part of chromatin signatures, 
whereas a large bin size may include non-informative 
regions, which may degrade the model performance. To 
better characterize signatures of IncRNA and coding 
genes, the efficient bin size out of gene body should be 
optimized, as no explicit bin size was proved best previ- 
ously. After conducting independent experiments on 
eight bin sizes (1, 2, 3, 4, 5, 6, 8, 10 kb), we found the per- 
formance of cross-validations for the full feature model was 
best at 3kb, though 3, 4 and 5kb have comparably high 
performances (Supplementary Figure S2A). 



The gene body was partitioned into three parts (k = 3), 
as discretization into three parts is an usual strategy. To 
evaluate whether the partition was most effective for our 
model, we also explored other k values for partitioning. As 
a result, a partition with k = 2 yielded a moderate result, 
whereas partitions with k = 5 and k = 6 yielded extremely 
poor results (Supplementary Figure S2B). The partitions 
with k = 3 and k = 4 generated comparable cross- 
validation performance. To make our model simple, we 
preferred to use a partition with k = 3. 

The use of loose thresholds for adding more features does 
not significantly improve the performance of the 
integrated model 

We only kept overrepresented features with common oc- 
currence >6, which was termed as the strict threshold for 
10 training models. One would speculate that a model can 
achieve even better performance with additional feature 
inclusion. To defy this, we reconstructed three feature-se- 
lected models with loose thresholds of 6, 5 and 4 common 
features in 10 models, respectively. For the integrated 
model with more features benefited by loose thresholds 
while keeping other settings unchanged, we observed 
minor performance increments in the IncRNA and 
coding gene predictions (Supplementary Figure S3 A). 
For example, the integrated model achieved a precision 
of 0.803 using the strict threshold, whereas the precision 
smoothly increased to 0.813 with a looser threshold of 
4/10. Such a minor increase of performance suggested 
that feature selection with a stricter threshold was 
suitable to perform a genome-wide prediction, with 
respect to both time and result robustness. 

We were also interested whether removal of any one, 
two or more features from the feature selected integrated 
model at the strict threshold would harm the model per- 
formance. For the feature-selected integrated model, when 
removal of any one feature, two features or three features, 
while keeping other settings unchanged, on average, we 
observed only a weak reduction of the model performance 
for only one feature removal (Supplementary Figure S3B), 
whereas the performance was worse when more features 
were removed. The result further suggested the efficiency 
of feature selection and the robustness of the proposed 
feature selected integrated model. 

Independent gene set testing shows that the integrated 
model including chromatin data identities IncRNAs more 
accurately than sequence only model 

Having the data set of known Ensembl IncRNA and 
protein-coding genes without training, we conducted 
model performance testing with the strict threshold, 
which was shown to be effective compared with other 
thresholds. It was tempting to evaluate our models 
without sequence features together with a sequence only 
model because it was unclear if chromatin data could sig- 
nificantly improve IncRNA prediction power. A set of 
6578 Ensembl IncRNA genes without overlapping any 
RefSeq genes were used as the IncRNA positive testing 
set for evaluation. In contrast, a set of 1495 protein- 
coding genes without overlapping any RefSeq genes 
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Table 3. Comparison of averaged performance of testing validation 
for Ensembl IncRNA and protein-coding genes without overlapping 
any RefSeq genes using selected features, chromatin data only and 
sequence only features 



Used features 


Sensitivity 


Specificity 


PPV 


All chromatin features 


0.706 


0.649 


86.0% 


All sequence features 


0.764 


0.657 


87.2% 


Feature selected features 


0.753 


0.823 


92.8% 



The comparisons include the sensitivity, specificity and PPV. 



were used as the negative testing set. The testing set is an 
unbalanced data set, the size ratio of IncRNAs versus 
coding genes is ~4.4. A more proportion of IncRNAs 
in testing set mimicked the situation where most 
unannotated transfrags in assembled transcriptome data 
were expected to be non-coding fragments. Ten independ- 
ent tests for 10 training models were performed, respect- 
ively. The averaged results were shown in Table 3. From 
Table 3, the chromatin data only model and the sequence 
only model achieved a comparable accuracy on predicting 
IncRNAs (average PPV = 86.0% using chromatin data 
only features versus 87.2% using sequence only 
features). It was not surprising that the chromatin data 
only model did not outperform the sequence only model 
because chromatin modification data from only limited 
tissues/cell types were included in integrated model, and 
the performance may be better if more marks and more 
data from other tissues/cell types could be added. The 
integrated model with selected features achieved a better 
PPV, suggesting the usefulness of feature selection. Taken 
together, the sequence and chromatin information were 
somewhat complementary, though the sensitivity of chro- 
matin data only model was inferior to that of the sequence 
only model. 

Though more effective sequence features would help 
identify IncRNAs in theory, it seemed that features 
involving ORF were most useful, which were considered 
irreplaceable by chromatin data. Therefore, an integrated 
model incorporating both effective chromatin information 
and genomic sequence information could help identify 
potential IncRNAs in a more effective manner. 

IncRNA prediction based on transfrags of uncharacterized 
genomic regions by integrated model 

We then applied our feature selected integrated model to 
predict IncRNAs from transfrag sets de novo assembled 
from RNA-seq data of same developmental stages with 
chromatin ChlP-seq data. After pre-processing of RNA- 
seq data (details refer to 'Materials and Methods' section), 
only unannotated intergenic transcripts were kept to be 
predicted by our model. Furthermore, potential coding 
transcripts were filtered out by the CPC program, which 
used six features of putative ORFs to distinguish protein- 
coding from non-coding genes. In summary, we obtained 
19 246, 17 230 and 2688 IncRNAs in E14.5 brain, CB and 
ES cell, respectively. About 80% of these were develop- 
mental stage-specific IncRNAs. 



IncRNAs were short, non-conserved and lowly ex- 
pressed, compared with protein-coding genes in previous 
studies (4,54,55). To explore whether the putative 
IncRNAs filtered here also had similar genomic character- 
istics, we analyzed the gene structure, conservation level 
and ORF length of developmental stage- specific putative 
IncRNAs of three tissue/cell types (Figure 2). We found 
that the length of predicted putative IncRNAs was on 
average a half of that of known protein-coding transcripts 
(mean length of 1549nt for IncRNAs versus 2676 nt for 
coding transcripts) (Figure 2A). Though shorter than 
known protein-coding transcripts, the putative IncRNAs 
were comparable with known RefSeq IncRNAs in length 
(1549nt for IncRNAs versus 1899 nt for known 
IncRNAs). Moreover, IncRNAs had fewer exons per tran- 
script (~1.5) than protein-coding gene (~1 1.2), even fewer 
than known IncRNA transcripts (~5.2) (Figure 2B). 
Though the less number of exons might be an underesti- 
mation of the actual size of putative IncRNAs due to po- 
tentially incomplete assembly of lowly expressed 
transcripts, much evidence suggested many IncRNAs 
were tended to be unspliced, compared with protein- 
coding genes (56). In supporting this, the gene transcrip- 
tional rates were considered to be positively associated 
with splicing machinery (57). Consistent with prior 
studies (17,55), the putative IncRNAs were less conserved 
than known coding transcripts (Figure 2C). Notably, the 
putative IncRNAs were associated with shorter ORFs 
than known protein-coding transcripts (mean length of 
162.8 nt for IncRNAs versus 1747.0 nt for coding tran- 
scripts) (Figure 2D), while comparable with known 
IncRNAs (384.9 nt for known IncRNAs). 

Previous studies have implicated IncRNAs as potential 
products of enhancer functions. It has been suggested that 
enhancer elements can produce short transcripts called en- 
hancer RNAs (58). Enhancer RNAs may be related to 
IncRNAs because both of enhancers and IncRNAs are 
highly tissue/developmental stage-specific in gene expres- 
sion. For enhancer marks were associated with IncRNAs 
based on feature selection, we were interested in whether 
putative IncRNAs were associated with enhancer related 
chromatin marks. H3K27ac is known as an active 
enhancer mark, which is more useful for testing 
enhancer than H3K4mel that marks both active and 
poised enhancer (59). To estimate the proportion of 
enhancer associated putative IncRNAs, we intersected 
gene body and promoter of putative IncRNAs at three 
developmental stages with H3K27ac enriched domains 
of matched stage, respectively. We observed that ~20% 
of promoters of predicted IncRNAs in E14 and CB and 
~10% of promoters of predicted IncRNAs in E14 brain 
were associated with H3K27ac enriched domain 
(Supplementary Figure S4). In addition, ~50% of 
IncRNA body of predicted IncRNAs (exons excluding 
first exon, the same for following results) were associated 
with H3K27ac (Supplementary Figure S4). Taken 
together, a large proportion of IncRNAs seemed to be 
regulated by enhancer marks, whereas only a small pro- 
portion (~20%) of IncRNAs may be enhancer products. 
If only common IncRNAs expressed in the two develop- 
mental stages were considered, the proportion of 
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Figure 2. The genomic property of putative IncRNAs with developmental stage specificity, compared with known IncRNAs and protein-coding 
genes with developmental stage specificity. (A) Putative IncRNAs display shorter transcript length than that of known IncRNAs and known protein- 
coding genes. (B) Putative IncRNAs display fewer number of exons than that of known IncRNAs and known protein-coding genes. (C) Putative 
IncRNAs display lower PhastCons conservation scores than that of known IncRNAs and known protein-coding genes. (D) Putative IncRNAs display 
comparable ORF length with that of known IncRNAs and shorter ORF length than that of known protein-coding genes. 



association with the mark is low (~4% for promoter and 
~13% for body, detailed data not shown), which could be 
reasoned that IncRNAs were precisely and specifically 
regulated by this enhancer mark. 

As an example of enhancer mark regulated IncRNAs, we 
showed in Figure 3 that four IncRNAs in an imprinting 
cluster between Dlkl and Meg3 were candidates associated 
with enhancer, which were supported by histone modifica- 
tion patterns (H3K27ac and H3K4mel) and available lit- 
erature (Figure 3). Only basal levels of H3K4me3 and PolII 
were observed, in contrast to the enrichment of two 
enhancer marks H3K27ac and H3K4mel in the cluster of 
putative IncRNAs. Court et al. used 3C-qPCR approach to 
study the chromatin dynamics and long-range cw-inter- 
actions for several genomic regions. Based on their data, 
we found that these IncRNAs were in close to the contact 
loci with other chromosomes (60). Put together, the 
IncRNAs we found were reckoned to have roles in chro- 
matin dynamics over large genomic distances even other 
chromosomes. As another example, we also found an 
IncRNA located ~8kb downstream of Zfp386 and ~4kb 
upstream of Vipr2, as shown in Supplementary Figure S5. 

We then examined the reliability of lncRN A predictions 
by evaluating whether predicted IncRNA regions were sup- 
ported by PolII occupancy and Cap Analysis of Gene 
Expression (CAGE) clusters. First, we analyzed the pre- 
dicted developmental stage-specific IncRNAs by stage- 
matched PolII ChlP-seq data. RNA polymerase II binds 



the promoters of virtually all known protein-coding and 
non-coding genes (61,62). Though PolII feature was used 
to build the training model, it was not included in the 
feature selected model, consistent with its role as a 
general gene transcription indicator. The PolII distribu- 
tions aligned by the putative IncRNAs in El 4. 5 brain 
were shown in Figure 4A and B. The distributions in 
other tissues/cell lines were shown in Supplementary 
Figures S6 and S7. As expected, we found a peak at the 
vicinity of TSSs in E14.5 brain, CB and ES cell, respectively 
(Figure 4A, Supplementary Figures S6A and S7A). 
Quantitatively, ~50% of promoters (TSS upstream 3k 
until end of first exon) of predicted IncRNAs were 
associated with PolII (Supplementary Figure S8A). 
Altogether, the PolII patterns were consistent with known 
distribution of PolII (63). The IncRNA body was signifi- 
cantly enriched with PolII signals over basal levels 
(Figure 4B, Supplementary Figures S6B and S7B). 
Approximately 40% of IncRNA body of predicted 
IncRNAs was associated with PolII (Supplementary 
Figure S8B). It has been known that genes with low expres- 
sion have more tendency of PolII distribution toward TSSs 
and less in gene body compared with genes with high ex- 
pression (64), which implied that most IncRNAs we 
identified were lowly expressed exclusively in specific 
stages, considering the stage of PolII ChlP-seq data was 
exactly matched with that of predicted stage-specific 
IncRNAs in this analysis. 
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Figure 3. Novel putative enhancer related IncRNAs located between Dlkl and Meg3 in mouse genome, which are supported by histone modification 
patterns and known literature. These IncRNAs are ~35kb downstream of Dlkl and ~40kb upstream of MegS. Coverage plots of several histone 
modification ChlP-seq data in E14.5 brain are shown at the bottom. Data for each chromatin modification are shown as a 'wiggle' track of extended 
reads. Other genomic annotations derived from the UCSC Genome Browser database are also shown with the direction of transcription indicated by 
arrows. The visualization is based on a local mirror of the UCSC Genome Browser. Chromosome coordinates (mm9) are shown on top of this figure. 



CAGE was developed to map promoters, and we 
were interested whether the predicted IncRNA promoters 
were supported by CAGE tags, which would add extra 
evidence for supporting our predictions. The basic 
assumption of the evaluation was that the larger the 
number of CAGE tags overlapped with the predicted 
IncRNA promoters, the more confident the predictions 
were. Though only one CAGE tag was needed to map a 
promoter, IncRNAs would be reliable by multiple tags, due 
to the potential noise of CAGE, which would be better to 
be demonstrated in the similar manner with that of PolII. 



The CAGE data used here were taken from the 
FANTOM4 project with over 20 tissues/cell lines 
including brain tissues (65). Analogous to PolII profile, 
we also found a peak at the vicinity of TSSs in El 4, CB 
and ES cell, respectively (Figure 4C, Supplementary 
Figures S6C and S7C). Quantitatively, >80% of promoters 
of predicted IncRNAs were associated with CAGE tags 
(Supplementary Figure S8C). No enrichment of CAGE 
tags in gene body was consistent with the fact that CAGE 
was developed to map TSSs rather than other genomic 
regions (Figure 4D, Supplementary Figure S6D 
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Figure 4. The average profile of PolII ChlP-seq tags and CAGE tags around TSS and within gene body for stage-specific IncRNAs in E14.5 brain. 
(A) The TSS of IncRNAs is enriched with PolII tags over basal levels, where PolII density is aligned around TSS with ± 3000 bp extensions. 
The average signal represents the average number of reads per 100-bp interval. (B) The gene body of IncRNAs normalized by length of 3000 bp 
with 1000-bp extension from TSS toward upstream and TTS toward downstream is enriched with PolII tags over basal levels. (C) The TSS 
of IncRNAs is enriched with CAGE tags over basal levels, where CAGE tag density is aligned around TSS with ± 3000-bp extensions. (D) The 
gene body of IncRNAs normalized by length of 3000 bp with 1000-bp extension from TSS toward upstream and TTS toward downstream is enriched 
with CAGE tags over basal levels. The size of the gene body of all IncRNAs is scaled to 3000 bp for comparison (Meta-gene defined by the CEAS 
package). 



and S7D). In contrast, ~60% of IncRNA body of 
predicted IncRNAs was associated with CAGE tags 
(Supplementary Figure S8D). Altogether, the evalu- 
ation of putative IncRNAs by PolII occupancy and 
CAGE data suggested that the integrated model was 
powerful for identifying developmental stage-specific 
IncRNAs. 

IncRNAs are coexpressed with nearby genes and are 
associated with differentiation and development 

To explore the regulatory roles of the predicted IncRNA 
candidates, we assigned developmental stage-specific 
IncRNA candidates to their closest genes and compared 
the gene expression changes of IncRNAs and their closest 
genes during development, as previous studies raised a 



hypothesis that the regulation of IncRNAs on nearby 
protein-coding genes was a possible regulatory mechanism 
(Figure 5A), and almost 40% of GENCODE v7 IncRNAs 
were estimated to flank protein-coding gene loci (17). We 
showed the distance distributions between IncRNAs and 
nearby protein-coding genes in Supplementary Figure S9, 
where most of E14.5 brain-specific IncRNAs (~71.5%) 
had gene neighbors within lOOkb and ~18.3% within 
10 kb. Similarly, -78.5% of CB-specific IncRNAs had 
gene neighbors within 100 kb and ~ 19.9% had within 
10 kb. The detailed distance information for IncRNAs 
and neighboring genes was available in Supplementary 
Table S3. An IncRNA candidate may be difficult to be 
assigned to one specific nearby gene, as IncRNAs were 
often located in intergenic deserts and even may not 
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regulate closest genes in a linear genome. Here, we preferred 
to keep IncRNAs with closest genes <500kb, and we also 
discarded IncRNAs with closest genes <5 kb to avoid poten- 
tial untranslated region extensions from known genes. We 
were interested in the expression changes of the closest genes 
to tissue-specific IncRNAs (10013 CB-specific IncRNAs, 
12029 El 4. 5 brain-specific IncRNAs and 7217 common 
IncRNAs) during development. The expression data we 
used were stage matched with chromatin modification data 
in ES, E14.5 brain and CB, respectively. Here, an IncRNA or 
nearby protein-coding gene was defined as upregulated if the 
expression in E14.5 brain is <50% of that in CB, whereas it 
was downregulated if the expression in E14.5 brain is >2- 
fold of that in CB. In Figure 5B, most (71 %) of closest genes 
of CB-specific IncRNAs were upregulated in CB, and most 
(75%) of closest genes of E14.5-specific IncRNAs were 
downregulated in CB, consistent with the proposed hypoth- 
esis. In contrast, neighboring IncRNAs expressed both in 
El 4. 5 or CB were not tended to have such a pattern 
(Figure 5C). In addition, 36% of closest genes of common 
IncRNAs expressed in all developmental stages were 
upregulated or downregulated (data not shown). These 
findings suggested that the putative IncRNA candidates 
regulated transcription of nearby genes in brain develop- 
ment, in contrast to common IncRNAs and neighboring 
IncRNA-lncRNA pairs. The coexpression of IncRNAs 
and neighboring coding genes was supposed to be regulated 
by chromatin modifications, as the genomic information 
remained constant for all developmental stages. 

To further testify the regulatory roles of IncRNAs in 
mouse brain development, we performed GO function 
enrichment for the genes closest to IncRNAs using the 
DAVID system (50) followed by clustering of resulting 



function terms with significant numbers of shared genes 
using Enrichment map (51). Nearest coding genes of the 
E14.5 brain-specific IncRNAs were significantly enriched 
in GO terms such as neuron differentiation and cell 
morphogenesis involved in differentiation and 
axonogenesis (FDR = 6.2 x 10~ 13 and 1.9 x 10~ 9 , re- 
spectively) (Figure 6) and included many genes involved 
in brain functional regulation: Bdnf, Dbxl, Alkbhl, 
Neurod2 and Nrcam, suggesting IncRNAs were potentially 
associated with embryonic brain development. Although 
in adult CB, there were only eight enriched terms 
(g-value < 0.021), which were related to neuron differenti- 
ation, transcriptional regulation and synaptic transmis- 
sion (Figure 7). This result indicated that embryonic 
brain continued to develop, and neuron differentiation 
did not finish even after mouse birth, compared with 
adult mouse brain. Although for common IncRNAs, 
there were only three enriched GO terms (FDR < 0.001), 
less than that in E14.5 brain (41 terms) and in CB (five 
terms). The full enriched GO terms for E14.5 brain- 
specific IncRNAs, CB-specific IncRNAs and common 
IncRNAs were listed in Supplementary Table S4. Taken 
together, the assumed links between IncRNAs and func- 
tional genes in brain development sounded reasonable, 
which also suggested the regulatory roles of putative 
IncRNAs in stage-specific brain development. 

Furthermore, we also used the UP_TISSUE annotation 
from the DAVID system to explore whether known genes 
close to stage-specific IncRNAs tended to be tissue-specific 
genes. The UP_TISSUE list is a curated list of gene ex- 
pression specificity based on literature mining. Indeed, we 
found brain-related organ-specific genes are highly 
enriched in genes neighboring stage-specific IncRNAs. 
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Figure 6. GO enrichment analysis of genes close to developmental stage-specific lncRNAs in E14.5 brain. The GO enrichment is done from DAVID 
(FDR < 0.01) and is followed by clustering of resulting function terms with significant numbers of shared genes using Enrichment map. Dense gene 
functions are surrounded by circles with function terms labeled aside. Line thickness between connected nodes is proportional to gene numbers 
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Figure 7. GO enrichment analysis of genes close to developmental stage-specific lncRNAs in adult CB. Terms are related to Neuron differentiation. 
Transcriptional regulation and Synaptic transmission. 
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There were 12 enriched terms in El 4. 5 brain and 17 terms 
in CB after multiple testing adjustment (Supplementary 
Figure S10). Though top terms were shared by stage- 
specific IncRNAs and common IncRNAs, a few terms 
were developmental stage-specific, like Cortex in El 4. 5 
brain and Hippocampus in adult CB, suggesting 
IncRNAs were involved in stage-specific organ develop- 
ment. As supporting evidence, transcriptional regulation 
in cortex might change significantly before and after birth 
(66). These results suggested that IncRNA regulation may 
play an important role during a critical window of brain 
development. 

Developmental stage-specific IncRNAs could be efficiently 
modeled by chromatin modifications 

Most of IncRNAs exhibited more tissue-specific expres- 
sion patterns than protein-coding genes (55). With this 
in mind, we investigated whether IncRNA expression spe- 
cificity can be explained by the integrated model. 
Specifically, we investigated the developmental stage- 
specific expression of known RefSeq IncRNAs by 
replacing the class label of the initial integrated model 
with the expression specificity and then re-building the 
model. The expression specificity included ES specificity, 
El 4. 5 brain specificity, CB specificity and, lastly, no spe- 
cificity. In detail, there were 429 ES-specific, 738 E14- 
specific and 605 CB-specific IncRNA genes, whereas 
there were 7412 ES-specific, 6598 E14-specific and 5698 
CB-specific protein-coding genes. In addition, 611 
IncRNAs and 5053 protein-coding genes belonged to no 
specificity class. Unexpectedly, we did not observe 
tendency of IncRNAs to have specificity labels compared 
with coding genes, reflecting the global transcriptomic 
changes during mouse brain development. For the 
balanced number of four classes of expression specific 
IncRNAs, we directly applied LASSO regularized multi- 
nomial logistic regression model to evaluate features 
without any partitions. Here, we only analyzed the 
model of IncRNAs for simplicity. 



We also performed feature selection and desired to 
know whether chromatin features from a particular devel- 
opmental stage contributed more to IncRNAs with 
matched developmental stage specificity in integrated 
model. Indeed, we found that most of top chromatin 
features of IncRNAs specific at El 4. 5 stage were chroma- 
tin features in E14, like H3K36me3 and H3K4me3, rather 
than in ES or CB (Figure 8). Similar situation was also 
revealed at the other two stages (Supplementary Figure 
SI 1 ), while the stage-specific features were different 
from non-specific genes (Supplementary Figure SI 2). 
The observation was consistent with the hypothesis that 
chromatin modifications directed tissue-specific or devel- 
opmental stage-specific IncRNA gene expression. When 
chromatin modification features and genomic sequence 
features were both included, the integrated model had a 
much lower precision and AUC in 10-fold cross-validation 
for predicting IncRNA expression specificity (Table 4). In 
addition, genomic sequence information was shown to 
have little contribution to the integrated model perform- 
ance, as genomic information degraded the integrated 
model performance (Precision from 0.823 in the chroma- 
tin data only model to 0.794 in the integrated model). 
Altogether, stage-specific chromatin data were proved 
more useful to predict mouse developmental stage- 
specific IncRNAs. 



Table 4. Comparison of averaged performance of 10-fold cross-valid- 
ation in IncRNA expression specificity predictions using all features, 
chromatin data only, sequence only and selected features 



Used features 


AUC 


Precision 


Recall 


All features 


0.936 


0.794 


0.793 


All chromatin features 


0.955 


0.823 


0.823 


All sequence features 


0.653 


0.385 


0.392 


Feature selected features 


0.641 


0.674 


0.674 



The comparisons include the AUC under ROC curves, precision and 
recall values 




-8 -7 -6 -5 -4 -3 

Log Lambda 

Figure 8. Selected informative features of embryonic E14.5 brain expression specificity determined by the strict threshold using multinomial logistic 
regression with LASSO regularization. Feature weights in predicting expression specificity of embryonic E14.5 brain IncRNAs with respect to log 
lambda, a penalty to shrink feature weights in the regression model. Weights of features with less discriminative power of the IncRNAs expressed in 
embryonic E14.5 brain shrink to 0 as lambda is increasing. Informative features of embryonic E14.5 brain IncRNAs are those with above-zero 
weights based on lambda value determined by cross-validation. 
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DISCUSSION 

Here, we present a systematic study to identify and assess 
discriminating features for IncRNA identification over 
mouse ES cell differentiation to adult CB. We point out 
the importance of feature selection by LASSO regularized 
binomial logistic regression model. Selected chromatin 
modification features are explained from a chromatin 
biology point of view. We also highlight the importance 
of chromatin information in prediction of development 
related IncRNAs. 

Recent genome-wide transcriptomic maps have revealed 
a growing body of putative IncRNAs (28,54,55,67-69). It 
has been established that the expression of IncRNAs is 
strongly associated with vertebrate development 
(54,55,67,69). Cufflinks and Scripture are two earlier pi- 
oneering tools that can assemble and quantify transcrip- 
tome-wide coding and non-coding transcripts based 
directly on RNA-seq data (67,70). Other programs or 
pipelines can also filter IncRNAs de novo from RNA-seq 
data, of which Codon Substitution Frequency (71) and 
PhyloCSF (72) are most typical. Many mouse develop- 
ment studies used these tools to identify and filter novel 
IncRNAs based on high-throughput transcriptomic data 
(4,28,67,70). However, such alignment-based tools are not 
suitable to identify lineage-specific IncRNAs and are time- 
consuming. Recent attempts to mitigate this problem of 
RNA-seq based strategy used multiple chromatin modifi- 
cations (12,73). In this study, we suggest that chromatin 
modification data involving mouse brain development can 
improve IncRNA prediction, both revealed in 10-fold 
cross-validation and independent gene set evaluation. 
Nevertheless, the model we present here is preliminary 
and is expected to further improve IncRNA predictions 
by using other chromatin data, given that ChlP-seq data 
in public databases are rapidly accumulating. A researcher 
who is interested in IncRNAs in specific loci related to 
other development and differentiation processes should 
use other tissue/cell-related chromatin data, though rela- 
tively comprehensive chromatin data were only available 
for a handful of tissues/cells and species. 

Individual features from our model are associated with 
IncRNAs to different extents, which prompts us to use 
modeling approaches to discriminate IncRNAs from 
coding genes. Feature selection using chromatin features 
has been previously demonstrated to be efficient for pre- 
dicting genomic elements such as enhancer and transcrip- 
tion factor binding site (TFBS), based on available histone 
modification ChlP-seq or ChlP-chip data (31,74-76). 
Particularly, Narlikar et al. (77) also use LASSO regular- 
ization to identify heart-specific enhancers based on 
sequence features. Inspired by these studies, we use chro- 
matin features from ChlP-seq data as well as genomic 
sequence features to identify IncRNAs. Because chromatin 
marks are generally enriched in the vicinity of TSSs and 
are broadly distributed in genie regions, which may yield 
co-occurrence of chromatin features and is difficult to be 
handled by generalized linear model. LASSO regulariza- 
tion is expected to eliminate overfitting to training sets and 
reduce feature space; thus, LASSO regularization-based 
logistic regression model is useful to model such data. 



It is assumed in previous studies that coding and non- 
coding RNAs are under regulation by similar epigenetic 
modifications, like H3K4me 1/2/3, H3K27me3, H3K9me3 
and H3K36me3. Though epigenetic mechanisms of 
protein-coding gene regulation are relatively well- 
characterized previously, whether the transcription of 
IncRNAs itself is regulated by chromatin regulators and 
how difference between coding genes and long non-coding 
genes is not well understood. To tackle the circumstance, 
Sati et al. (78) compared chromatin modifications between 
genome-wide IncRNA and coding genes and observed that 
IncRNAs seemed to share same chromatin marks with 
coding genes, except DNA methylation and H3K9me3 
that did not seem to regulate IncRNAs. Later, Santoni 
(79) developed an algorithm to characterize given 
genomic elements based on ChlP-seq data. Based on the 
algorithm, H3R2mel mark was shown to be differentially 
distributed between IncRNA and coding genes. In the 
integrated model, 16 features are effective for distinguish- 
ing IncRNA from protein-coding genes. The top features 
predictive of IncRNAs are three repeat elements, followed 
by ES_H3K9me3_exons and E14_H3K27ac_exons, all in 
agreement with IncRNA characteristics. Transposable 
elements were previously shown to be associated with 
IncRNAs (80). It is supposed that IncRNA evolution is 
driven by transposable element insertions, which would 
partially explain the poor conservation of IncRNA tran- 
scripts. Once, H3K9me3 was not considered as a distin- 
guishing mark of IncRNAs; here, we show that H3K9me3 
is generally moderately related to IncRNAs. Compared 
with a prior study (78), we detected significant 
H3K9me3 differences in gene body rather than TSS 
proximal regions, which may explain why H3K9me3 had 
role in predicting IncRNAs but failed to be detected pre- 
viously. ORF-related features are shown to have import- 
ant roles in predicting coding genes, and ORF length is 
considered as the most efficient feature for distinguishing 
IncRNA from coding genes. 

There are two candidate explanations for the improved 
model performance and potential usefulness of chromatin 
information relative to DNA sequence-derived informa- 
tion. First, chromatin modifications are indicative of 
enhancer regions and other intergenic elements such as 
TFBS regions, as shown by recent high-throughput char- 
acterization studies in human (74,81). The explanation 
seems likely, given that available evidence suggests that 
some chromatin marks are indicative of enhancer. Here, 
we discover many instances where IncRNAs are marked 
by H3K4mel and/or H3K27ac. Therefore, the model may 
be biased toward enhancer elements. Though enhancer- 
related IncRNAs may be important for recruitment of 
chromatin-modifying enzymes (82), we also find a large 
number of transcripts marked by well-recognized tran- 
scription initiation marks, such as H3K4me3 and PolII, 
whereas a less number of transcripts marked by H3K4mel 
and/or H3K27ac, as exemplified by those IncRNAs in 
Supplementary Figure SI 3 and SI 4. In contrast to 
Figure 3, these IncRNAs have relatively weaker 
H3K27ac and H3K4mel marks, suggesting they are less 
likely to relate to enhancer function. Developmental spe- 
cificity prediction also suggests that other chromatin 
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features contribute more to specificity predictions than 
enhancer-related features. From Supplementary Figure 
S4, we estimate that the proportion of IncRNAs with 
H3K27ac occupancy is ~40%. Though the proportion 
of enhancer-related IncRNAs is difficult to determine, 
enhancer-related IncRNAs may not dominate the 
IncRNA reservoir. 

Second, the improved performance is more likely to be 
contributed by chromatin features, which is well supported 
in this article. Interestingly, some chromatin features 
such as H3K9me3 and H3K27ac are relatively higher 
distributed along gene body of IncRNAs at different devel- 
opmental stages, whereas the other chromatin features are 
more stage-specific (Figures 1 and 8 and Supplementary 
Figure Sll). The predictability of stage-specific IncRNAs 
is an evident merit, given that IncRNAs may only act 
in specific cell/tissue types. The simple strategy of 
incorporating chromatin information in model can 
improve IncRNA predictions and stage-specific predic- 
tions, which can be explained that chromatin data reflect 
directly the molecular activity that modulates IncRNA 
transcription, whereas sequence information is indirect 
and interwoven with a few confounding variables such as 
transcription factor binding. However, chromatin and 
sequence data are shown complementary, possibly due to 
the limited chromatin modification data in use. An 
integrated model involving multiple information sources 
is therefore useful for IncRNA predictions. Taken 
together, these observations suggest that chromatin modi- 
fications account for the performance improvement 
compared with sequence only model. Though the 
proposed model achieves satisfactory performance, 
IncRNA transcriptions may be regulated by different regu- 
latory mechanisms but are not within the central topic of 
this article, which can be separately analyzed by machine- 
learning tools in future studies. 

There are exceptions where chromatin information does 
not improve IncRNA predictions. The DNA-binding 
protein CTCF does not improve IncRNA predictions, in 
contrast to histone modification marks. We are not 
surprised because chromatin modifications are not 
closely related to this particular DNA-binding protein, 
in accordance with a previous study (83). Besides CTCF, 
PolII is also irrelevant to IncRNA prediction, which is 
consistent with that PolII can bind to both transcription- 
ally active and inactive gene promoters (84,85). Notably, 
though H3K4mel and H3K27ac are overrepresented for 
IncRNAs in the integrated model, it is useless for stage- 
specific IncRNA predictions. 

The integrated model is applied to transcriptome RNA- 
seq data and predicts many novel IncRNAs that are 
around imprinted genes. Many imprinted genes are them- 
selves non-coding RNAs, including Meg3 and Aim. Our 
predictions appear to reveal the regulation of nearby genes 
by IncRNAs, such as those putative IncRNAs upstream of 
Meg3 (Figure 3). 

Interestingly, we have also analyzed the enriched func- 
tions of known genes close to IncRNAs. In agreement with 
previous studies, genes associated with putative IncRNAs 
are enriched in GO terms related to neuron differentiation 
and transcriptional regulation. Additionally, the modified 



model for investigating developmental stage specificity 
suggests the potential roles of chromatin modifications 
in stage-specific non-coding RNA regulation. The regula- 
tion by distal IncRNAs on nearby genes perhaps allows 
gene expression to be fine tuned in a cell-type/developmen- 
tal stage-specific manner. Together, these findings suggest 
that IncRNAs tend to be involved in stage-specific tran- 
scriptional regulation. 

CONCLUSIONS 

We show here the logistic regression model with LASSO 
regularization can be used to predict IncRNAs using 
selected top characteristic histone modification and 
genomic features. We evaluate the integrated model by 
cross-validation together with independent data testing. 
Though more features are helpful for IncRNA predictions, 
the model performance is contributed mostly by top 
features. We compare the performance of model with 
only chromatin information and model with only 
genomic sequence information and show that they are 
both irreplaceable by each other, suggesting the useful- 
ness of an integrated model with both types of informa- 
tion. The observation suggests that the integrated model 
has an acceptable capability of learning complicated 
patterns from weak and complex chromatin and 
genomic patterns in IncRNAs. When applying the 
integrated model to assembled transcripts from RNA- 
seq data, we demonstrate the putative IncRNAs show 
strong tendency to close to genes associated with functions 
such as neuron differentiation and transcriptional regula- 
tion. For the tissue and developmental stage specificity of 
IncRNAs, more high-throughput chromatin data would 
be critical in deciphering non-coding RNA regulation, 
which would be encouraged by more availing high- 
throughput data. With more and more high-throughput 
chromatin modification data at hand, we envision that 
integrative modeling will facilitate more cell/tissue- 
specific IncRNA predictions. 
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